LS4003 R worksheet 1
Penguins laying eggs
Introduction to the data
For this worksheet, you will need the penguins.csv file from the Canvas page.
This is data from a study which looked at penguin couples. Once a penguin pair had laid an egg, both parents were captured and measurements were taken. They were then tracked to see if they would go on to have a second egg or “Full Clutch”.
This dataset contains the following information:
| Column | Data |
|---|---|
| SampleNumber | Unique number for every penguin |
| LatinName | Species name, full Latin name |
| CommonName | Common name for the penguin species |
| FullClutch | Whether or not the penguin couple laid two eggs |
| FlipperLength | Length of the penguin’s flipper in millimeters |
| BodyMass | Mass of the penguin in grams |
| Sex | Whether the penguin is male or female |
And is looking at penguins of the following three species:
The tasks
Task 1: Read the data
To start off, you will want to:
- Read the csv file into R as a dataframe
- Use
summaryto get an overview of the data - Calculate the mean and standard deviation of the numerical columns
Task 2: Simple boxplots
Now you have read the data into R, you should be able to use ggplot2 to draw boxplots.
Make boxplots of:
- FlipperLength, separated by species
- BodyMass, separated by species
Which value do you want on the x-axis? What about the y-axis?
To colour by species, try using the fill = option in aes().
A boxplot of FlipperLength separated by species should look like this:
Task 3: Split boxplots by multiple categories
In the last example we used the same column of data for the x axis and the fill colour. If we change one of these, we can visualise the data further.
Make boxplots of:
- FlipperLength, separated by species and Sex
- FlipperLength, separated by species and FullClutch
- BodyMass, separated by species and Sex
- BodyMass, separated by species and FullClutch
Task 4: Visualise distributions of FlipperLength and BodyMass
Next, use a histogram to visualise the distributions of sizes. You may want to use the fill = option to separate by one of the categories.
Make histograms of:
- FlipperLength, separated by species
- BodyMass, separated by species
Task 5: Use filter to plot results for individual species
- Using the
filter()function, extract the values of just one species of penguin and save these in a new dataframe. - Repeat as in Task 4, but separate the distributions by Sex or FullClutch
Extension: Larger penguin dataset
If you finish all of the above and would like to challenge your skills, download the Penguins_extension.csv from the Canvas page.
This is the same dataset but with more columns:
| Column | Data |
|---|---|
| SampleNumber | Unique number for every penguin |
| LatinName | Species name, full Latin name |
| CommonName | Common name for the penguin species |
| Region | Region of Antartica the penguin is in |
| Island | Island of Antartica the penguin’s nest is in |
| FullClutch | Whether or not the penguin couple laid two eggs |
| EggDate | Date at which the first egg was laid |
| CulmenLength | Length of the penguin’s culmen in millimeters |
| CulmenDepth | Depth of the penguin’s culmen in millimeters |
| FlipperLength | Length of the penguin’s flipper in millimeters |
| BodyMass | Mass of the penguin in grams |
| Sex | Whether the penguin is male or female |
One column of this dataset is dates. By default these will be treated as strings - as words with no set order.
This means that dates won’t be plotted in order from oldest to newest, which is not very useful.
To convert our dates into a “Date” class - to tell R that these are dates - we can use the following code:
penguin_df$EggDate <- as.Date(penguin_df$EggDate, format = "%d/%m/%y")At this point, it’s up to you what you want to do - explore the data! What can you find out? Are there any patterns emerging?
This is the end of the worksheet. Don’t worry if it still seems a bit alien - we have four more sessions after the winter break.